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Abstract 

These notes describe the most efficient hash functions currently known for hashing integers 
and strings. These modem hash functions are often an order of magnitude faster than those 
presented in standard text books. They are also simpler to implement, and hence a clear win 
in practice, but their analysis is harder. Some of the most practical hash functions have only 
appeared in theory papers, and some of them requires combining results from different theory 
papers. The goal here is to combine the information in lecture-style notes that can be used by 
theoreticians and practitioners alike, thus making these practical fruits of theory more widely 
accessible. 


1 Hash functions 

The concept of truly independent hash functions is extremely useful in the design of randomized 
algorithms. We have a large universe U of keys, e.g., 64-bit numbers, that we wish to map ran¬ 
domly to a range [m] = {0,..., m — 1} of hash values. A truly random hash function h : U ^ [m] 
assigns an independent uniformly random variable h{x) to each key in x. The function h is thus a 
I f/1-dimensional random variable, picked uniformly at random among all functions from U to [m]. 

Unfortunately truly random hash functions are idealized objects that cannot be implemented. 
More precisely, to represent a truly random hash function, we need to store at least \ U\ log 2 m bits, 
and in most applications of hash functions, the whole point in hashing is that the universe is much 
too large for such a representation (at least not in fast internal memory). 

The idea is to let hash functions contain only a small element or seed of randomness so that 
the hash function is sufficiently random for the desired application, yet so that the seed is small 
enough that we can store it when first it is fixed. As an example, if p is prime, a random hash 
function h : [p] —)■ [p] = {0,..., p — 1} could be h{x) = {ax + h) mod p where a and b are random 
numbers that together form the random seed describing the function. In these notes we will discuss 
some basic forms of random hashing that are very efficient to implement, and yet have sufficient 
randomness for some important applications. 
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1.1 Definition and properties 

Definition 1 A hash function h : U ^ [m] is a random variable in the class of all functions 
U —)■ [m], that is, it consists of a random variable h{x) for each x E U. 

For a hash function, we care about roughly three things: 

Space The size of the random seed that is necessary to calculate h{x) given x. 

Speed The time it takes to calculate h{x) given x. 

Properties of the random variable. 

In the next sections we will mention different desirable properties of the random variable and 
describe how to obtain them efficiently in terms of space and speed. 


2 Universal hashing 

The concept of universal hashing was introduced by Carter and Wegman in |[2|. We wish to 
generate a random hash function h : f/ —)■ [m] from a key universe U to a set of hash values 
[m] = — 1}. We think of h as a random variable following some distribution over func¬ 

tions U [m\. We want h to be universal which means that for any given distinct keys x,y eU, 
when h is picked at random (independently of x and y), we have low collision probability: 

Pr[/i(a:) = h{y)] < 1/m. 

h 

For many applications, it suffices if for some c = 0(1), we have 

PY[h{x) = h{y)] < c/m. 


Then h is called c-universal. 

In this chapter we will first give some concrete applications of universal hashing. Next we will 
show how to implement universal hashing when the key universe is an integer domain U = [u] = 
{0,..., M — 1} where the integers fit in a machine word, that is, u < 2"^ where w = 64 is the word 
length. In later chapters we will show how to make efficient universal hashing for large objects 
such as vectors and variable length strings. 

Exercise 2.1 Is the truly independent hash function h : U ^ [m] universal? 

Exercise 2.2 If a hash function h : U ^ [m] has collision probability 0, how large must m be? 
Exercise 2.3 Letu < m. Is the identity function f{x) = x a universal hash function [u] [m]? 
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2.1 Applications 

One of the most classie applications of universal hashing is hash tables with chaining. We have a 
set S' C f/ of keys that we wish to store so that we can find any key from S in expected constant 
time. Let n = [S'! and m > n. We now pick a universal hash function h : U ^ [m], and then 
create an array L of m lists/chains so that for i G [m], L[i] is the list of keys that hash to i. Now 
to find out if a key x G 17 is in S', we only have to check if x is in the list L[h{x)]. This takes time 
proportional to 1 + |L[/i(x)]| (we add 1 because it takes constant time to look up the list even if 
turns out to be empty). 

Assume that x ^ S' and that h is universal. Let / [y) be an indicator variable which is 1 if 
h{x) = h{y) and 0 otherwise. Then the expected number of elements in L[h{x)] is 


Eu[\L[Hx)]\] 


& 


.yes 


= ^Pr[/i(j/) = h{x)] = n/m < 1. 
yes yeS 


The second equality uses linearity of expectation. 

Exercise 2.4 (a) What is the expected number of elements in L[h{x)] ifx G S? 

(b) What bound do you get if h is only 2-universal? 

The idea of hash tables goes back to dSl, and hash tables were the prime motivation for introduction 
of universal hashing in For a text book description, see, e.g., [|3l §11.2]. 

A different applications is that of assigning a unique signature s(x) to each key. Thus we 
want s(x) f s{y) for all distinct keys x,y ^ S. To get this, we pick a universal hash function 
s : U ^ [n^]. The probability of an error (collision) is calculated as 


Pr[3{x, ?/} C : s(x) = s( 2 /)] < ^ Pr[s(x) = s(|/)] = f ^ j< l/(2n). 

L,y}cs ^ ^ ^ 

The first inequality is a union bound', that the probability of that at least one of multiple events 
happen is at most the sum of their probabilities. 

The idea of signatures is particularly relevant when the keys are large, e.g., a key could be a 
whole text document, which then becomes identified by the small signature. This idea could also 
be used in connection with hash tables, letting the list L[i\ store the signatures s(x) of the keys that 
hash to i, that is, L[i\ = {s(x)|x G X, h{x) = i}. To check if x is in the table we check if s(x) is 
in L[h{x)]. 

Exercise 2.5 With s : U ^ [n^] and h : U ^ [n] independent universal hash functions, for a 
given X E U \ S, what is the probability of a false positive when we search x, that is, what is the 
probability that there is a key y E S such that h{y) = h{x) and s{y) = s(x) ? 

Below we study implementations of universal hashing. 
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2.2 Multiply-mod-prime 

Note that if m > u, we ean just let h be the identity (no randomness needed) so we may assume 
that m < u. 

The classie universal hash funetion from |l2l is based on a prime number p > u. We piek a 
uniformly random aG [p]+ = {l,...,p — 1} and 6 G [p] = {0,..., p — 1}, and define ha,b ■ [«] —^ 
[m] by 

ha,b{x) = {{ax + b) mod p) mod m) (1) 

Given any distinet x,y E [u] C [p], we want to argue that for random a and b that 

Pr [ha,b{x) = hafi{y)] < f/m. (2) 

ae[p]+, fee[p] 

In most of our proof, we will consider all a G [p], including a = 0. Ruling out a = 0, will only be 
used in the end to get the tight bound from Q. 

We need only one basic fact about primes: 

Fact 2.1 Ifp is prime and a, /3 E [p]+ then a(3 ^ 0 (mod p). 

Let x,y E [p], a; 7 ^ p be given. For given pair (a, b) G [p]^, define (g, r) G [p]^ by 

ax + b mod p = q (3) 

ay + b mod p = r. (4) 


Lemma 2.2 Equations (|21) and (I?]) define a 1-1 correspondence between pairs {a, b) E [p]^ and 
pairs {q, r) E [p]^. 

Proof For a given pair (r, q) E [p]^, we will show that there is at most one pair (a, b) G [p]^ 
satisfying ([3]) and dH). Subtracting ([3]) from dH) modulo p, we get 

{ay + b) — {ax + b) = a{y — x) = r — q (mod p), (5) 

We claim that there is at most one a satisfying dS]). Suppose there is another a' satisfying dH). 
Subtracting the equations with a and a', we get 

(a — a'){y — x) = 0 (mod p), 

but since a — a' and y — x are both non-zero modulo p, this contradicts Fact 12.11 There is thus at 
most one a satisfying d5]) for given {q, r). With this a, we need b to satisfy dU), and this determines 
b as 

b = {q — ax) mod p. (6) 

Thus, for each pair (g, r) G [p]^, there is at most one pair (a, b) E [p]^ satisfying d3]) and dll)- On 
the other hand, dl]) and dll) define a unique pair (g, r) G [p]^ for each pair (a, b) E [p]^. We have p^ 
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pairs of each kind, so the correspondence must be 1 - 1 . ■ 

Since x ^ y,hy Fact 12.11 

r = q a = 0. (7) 

Thus, when we pick (a, h) G [p]+ x [p], we get r ^ q. 

Returning to the proof of dH), we get a collision ha,b{x) = ha^iv) if and only if g = r (mod m), 
and we know that q ^ r. For given r E [p], there are at most \p/m] values of g G [p] with g = r 
(mod m); namely r', r' + m, r' + 2m,... < p for r' = r mod m. Ruling out q = r leaves us 
at most \p/m~\ — 1 values of g G [p] for each of the p values of r. Noting that \p/m~\ — 1 < 
(p + m — l)/m —1 = (p — l)/m, we get that the total number of collision pairs (r, g), r 7 ^ g, is 
bounded by p(p — l)/m. Then our 1-1 correspondence implies that there are at most p(p — l)/m 
collision pairs (a, h) G [p]+ x [p]. Since each of thep(p — 1) pairs from [p]+ x [p] are equally likely, 
we conclude that the collision probability is bounded by 1 /m, as required for universality. 

Exercise 2.6 Suppose we for our hash function also consider a = 0, that is, for random (a, h) G 
[p]^, we define the hash function ha,b ■ \p] —^ [m] by 

ha,b{x) = {{ax + h) mod p) mod m. 

(a) Show that this function may not be universal. 

(b) Prove that it is always 2-universal, that is, for any distinct x,y E [p], 

Pr \ha^h{x) = ha^^{y)] < 2/m. 

(a,b)G[p]2' 

2.2.1 Implementation for 64-bit keys 

Let us now consider the implementation of our hashing scheme 

h{x) = {{ax) modp) mod m) 

for the typical case of 64-bit keys in a standard imperative programming language such as C. Let’s 
say the hash values are 20 bits, so we have u = 2 ®^ and m = 2 ^°. 

Since p > u = 2®^, we generally need to reserve more than 64 bits for a E [p]+, so the product 
ax has more than 128 bits. To compute ax, we now have the issue that multiplication of w- bit 
numbers automatically discards overflow, returning only the w least significant bits of the product. 
However, we can get the product of 32-bit numbers, representing them as 64-bit numbers, and 
getting the full 64-bit result. We need at least 6 such 64-bit multiplications to compute ax. 

Next issue is, how do we compute ax mod p? For 64-bit numbers we have a general mod- 
operation, though it is rather slow, and here we have more than 128 bits. 

An idea from O is to let p be a Mersenne prime, that is, a prime of the form 2'^ — 1. We can 
here use the Mersenne prime 2®® — 1. The point in using a Mersenne prime is that 

X = X mod 2‘^ + L^/2'^J (mod p). 
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Exercise 2.7 Assuming a language like C supporting 64-bit multiplication, addition, shifts and 
bit-wise Boolean operations, but no general mod-operation, sketch the code to compute fax + 
h) mod p) mod m with p = 2®® — 1 and m = 2"^^. 


2.3 Multiply-shift 

We shall now turn to a truly practical universal hashing scheme proposed by Dietzfelbinger et 
al. [0, yet ignored by most text books. It generally addresses hashing from re-bit integers to £-bit 
integers. We pick a uniformly random odd re-bit integer a, and then we compute ha : [2"'] —)■ [2^^], 
as 

ha{x) = [{ax mod 2"')/2"'“^J (8) 

This scheme gains an order of magnitude in speed over the scheme from ([T]), exploiting operations 
that are fast on standard computers. Numbers are stored as bit strings, with the least significant bit 
to the right. Integer division by a power of two is thus accomplished by a right shift. For hashing 
64-bit integers, we further exploit that 64-bit multiplication automatically discards overflow, which 
is the same as multiplying modulo 2®"^. Thus, with w = 64, we end up with the following C-code: 

#include <stdint.h> //defines uint64_t as unsigned 64-bit integer. 
uint64_t hash(uint64_t x; uint64_t 1; uint64_t a) { 

// hashes x universally into 1 bits using the random odd seed a. 
return (a*x) >> (64-1);} 


This scheme is many times faster and simpler to implement than the standard multiply-mod-prime 
scheme, but the analysis is a more subtle. 

It is convenient to think of the bits of a number as indexed with bit 0 the least significant bit. 
The scheme is simply extracting bits w — i,... ,w — 1 from the product ax, as illustrated below. 


ax 


w-1 


w-Z 0 


(a*x)>>(w-1) 


We will prove that multiply-shift is 2-universal, that is, for x y, 

Pr [ha{x) = ha{y)] < 2/2^ = 2/m. (9) 

odd as [2™] 

We have ha{x) = ha{y) if and only if ax and ay = ax a{y — x) agree on bits w — . ,w — 1. 

This match requires that bits w — . ,w — lof a{y — x) are either all Os or all Is. More precisely, 

if we get no carry from bits 0 ,... — £ when we add a{y — x) to ax, then ha{x) = ha{y) 

exactly when bits w — i,w — 1 of a{y — x) are all Os. On the other hand, if we get a carry 
1 from bits 0,... ,w — i when we add a{y — x) to ax, then ha{x) = ha{y) exactly when bits 
w — i,w — 1 of a{y — x) axe all Is. To prove dH), it thus suffices to prove that the probability 
that bits w — i,... ,w — 1 of a{y — x) are all Os or all Is is at most 2/2^. 

We will exploit that any odd number 2; is relatively prime to any power of two: 

Fact 2.3 If a is odd and (3 G [2^]+ then ajd ^ 0 (mod 2'^). 
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Define b such that a = 1 + 26. Then 6 is uniformly distributed in [2^ . Moreover, define z to be 

the odd number satisfying {y — x) = z2\ Then 

a{y -x)= zT + 6^2*+\ 


w-1 


+ b z 2 


= a(y-x) 



Now, we prove that hz mod 2"'“^ must be uniformly distributed in [2"'“^]. First, note that there 
is a 1-1 correspondence between the 6 G [ 2 '^“^] and the products hz mod 2 "'“^; for if there were 
another h' G [2"'“^] with h'z = hz (mod 2"'“^) z( 6 ' — 6 ) = 0 (mod 2"'“^), then this would 

contradict Fact l2.3l since z is odd. But then the uniform distribution on 6 implies that hz mod 2"'“^ 
is uniformly distributed. We conclude that a{y — x) = z2'^ + 6z2*+^ has 0 in bits 0,..., f — 1, 1 in 
bit i, and a uniform distribution on bits i + 1,... ,i + w — 1. 

We have a collision ha{x) = ha{y) if ax and ay = ax + a{y — x) are identical on bits 
w — i,... ,w — 1. The two are always different in bit 1, so if z > w — i, we have ha{x) ^ ha{y) 
regardless of a. However, if z < w — i, then because of carries, we could have ha {x) = ha{y) if bits 
w — i,... ,w — 1 of a{y — x) are either all Os, or all Is. Because of the uniform distribution, either 
event happens with probability 1/2^, for a combined probability bounded by 2/2^. This completes 
the proof of dH). 

Exercise 2.8 Why is it important that a is odd? Hint: consider the case where x and y differ only 
in the most significant bit. 

Exercise 2.9 Does there exist a key x such that ha{x) is the same regardless of the random odd 
number a? If so, can you come up with a real-life application where this is a disadvantage? 


3 Strong universality 

We will now consider strong universality [fT^ . For 6, : [zz] —)■ [m], we consider pair-wise events of 
the form that for given distinct keys x, z/ G [zz] and possibly non-distinct hash values g, r G [zn], we 
have h{x) = q and h{y) = r. We say a random hash function 6. : [zz] —)■ [m] is strongly universal 
if the probability of every pair-wise event isl/mf. We note that if h is strongly universal, it is also 
universal since 

Pr[/z(x) = 6.(z/)] = Pr[6,(x) = q A h{y) = q] = m/mf = 1/m. 
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Observation 3.1 An equivalent definition of strong universality is that each key is hashed uni¬ 
formly into [m], and that every two distinct keys are hashed independently. 


Proof First assuming strong universality and consider distinct keys x,y E U. For any hash value 
q G [m], Pr[/i(a;) = q] = XlreH = q A h{y) = r] = m/im? = 1/m, so h{x) is uniform 

in [m], and the same holds for h{y). Moreover, for any hash value r G [m], 

Pr[/i(a;) = q \ h{y) = r] = Pr[/i(a;) = q A h{y) = r]/ PT[h{y) = r] 

= (l/m^)/(l/m) = 1/m = Pr[/i(a;) = q], 

so h{x) is independent of h{y). For the converse direction, when h{x) and h{y) are independent, 
Pi[h{x) = q A h{y) = r] = Pr[/i(a;) = q] ■ PT[h{y) = r], and when h{x) and h{y) are uniform, 
Pr[/i(a;) = q] = Pr[/i(|/) = r] = 1/m, so Pr[/i(a;) = q] ■ Pr[/i(|/) = r] = 1/imf. ■ 

Emphasizing the independence, strong universality is also called 2-independence, as it concerns a 
pair of two events. 

Exercise 3.1 Generalize 2-independence. What is 3-independence? k-independence? 

As for universality, we may accept some relaxed notion of strong universality. 

Definition 2 We say a random hash function h : U ^ [m] is strongly c-universal if 

1. h is c-uniform, meaning for every x E U and for every hash value q E [m], we have 
Pr[/i(a;) = q] < c/m and 

2. every pair of distinct keys hash independently. 

Exercise 3.2 Ifh is strongly c-universal, what is the pairwise event probability, 

PT[h{x) = q A h{y) = r]? 

Exercise 3.3 Argue that ifh\U^ [m] is strongly c-universal, then h is also c-universal. 
Exercise 3.4 Is Multiply-Shift strongly c-universal for any constant c? 

3.1 Applications 

One very important application of strongly universal hashing is coordinated sampling, which is 
crucial to the handling of Big Data and machine learning. The basic idea is that we based on small 
samples can reason about the similarity of huge sets, e.g., how much they have in common, or how 
different they are. 

First we consider the sampling from a single set A C f/ using a strongly universal hash function 
h -. U ^ [m] and a threshold f G {0,... , m}. We now sample x if h{x) < t, which by uniformity 
happens with probability t/m for any x. Let Sh,t{Ai) = {x E A \ h{x) < t} denote the resulting 
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sample from A. Then, by linearity of expeetation, E[\Sh,t{A) |] = \A\-t/m. Conversely, this means 
that if we have Sh,t{A), then we can estimate \ A\ as \Sh,tiA) \ ■ m/t. 

We note that the universality from Section does not in general suffice for any kind of sam¬ 
pling. If we, for example, take the multiplication-shift scheme from Section l23l then we always 
have /i(0) = 0, so 0 will always be sampled if we sample anything, that is, if f > 0. 

The important application is, however, not the sampling from a single set, but rather the sam¬ 
pling from different sets B and C so that we can later reason about the similarity, estimating the 
sizes of their union BVJ C and intersection B AC. 

Suppose we for two different sets B and C have found the samples Sh,t{B) and Sh,tiC). Based 
on these we can compute the sample of the union as the union of the samples, that is, Sh,tiB[JC) = 
Sh,t{B) U Sh,t{C). Likewise, we can compute the sample of the intersection as Sh,t{B n C) = 
Sh,t{B) n Sh,t{C). We can then estimate the size of the union and intersection multiplying the 
corresponding sample sizes hy m/t. 

The crucial point here is that the sampling from different sets can be done in a distributed 
fashion as long as a fixed h and t is shared, coordinating the sampling at all locations. This is used, 
e.g., in machine learning, where we can store the samples of many different large sets. When a 
new set comes, we sample it, and compare the sample with the stored samples to estimate which 
other set it has most in common with. Another cool application of coordinated sampling is on 
the Internet where all routers can store samples of the packets passing through Q. If a packet is 
sampled, it is sampled by all routers that it passes, and this means that we can follow the packets 
route through the network. If the routers did not use coordinated sampling, the chance that the 
same packet would be sampled at multiple routers would be very small. 

Exercise 3.5 Given Sh,t{B) and Sh,t{C), how would you estimate the size of the symmetric differ¬ 
ence {B \C) U {C \ B)? 

Below, in our mathematical reasoning, we only talk about the sample Sh,t{A) from a single set 
A. However, as described above, in many applications, A represent a union B U C or intersection 
B nC of different sets B and C. 

To get a fixed sampling probability t/m for each a; G f/, we only need that h : U ^ [m] is 
uniform. This ensures that the estimate \Sh,t{A)\-m/t of\A\ is unbiased, that is, E[\ShffA)\rn/f\ = 
|A|. The reason that we also want the pair-wise independence of strong universality is that we 
want \Sh,t{A)\ to be concentrated around its mean |A| • t/m so that we can trust the estimate 

\Sh,tiA)\ - m/t of |A|. 

For a E A, let Xa be the indicator variable for a being sampled, that is Xa = [h{a) < t]. Then 

\S,ffA)\=X = ZaeAXa. 

Since h is strongly universal, for any distinct keys a,b e A, h{a) and h{h) are independent, 
but then Xa and X}y are also independent random variables. The Xa are thus pairwise independent, 
so Var[X] = Var[Xa]. Let p = t/mho the common sampling probability for all keys. Also, 
\oi p = p\A\ = E[X] be the expectation of X. For each Xa, we have E[Xa\ = p, so Var[Xa] = 
p(l — p). It follows that Var[X] = p{l — p) < p. By definition, the standard deviation of X is 

(Tx = A/Var[X] = \/p{l -p). (10) 
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Now, by Chebyshev’s inequality (see, e.g., iH Theorem 3.3]), for any g > 0, 


Pr[|X -/i| > qax] < l/q^- 


( 11 ) 


Exercise 3.6 Suppose that 1^41 = 100,000,000 and p = t/m = 1/100. Then E[X] = p = 
1,000,000. Give an upper bound for the probability that \X — p\ > 10,000. These numbers 
correspond to a 1% sampling rate and a 1% error. 

3.2 Multiply-mod-prime 

The elassie strongly universal hashing seheme is a multiply-mod-prime seheme. For some prime 
p, uniformly at random we piek (a, b) E [p^ and define ha,b ■ [p] —^ [p] by 

ha,b{x) = {ax -f b) mod p. (12) 

To see that this is strongly universal, eonsider distinet keys x,y E [p] and possibly non-distinet 
hash values q,r E [p], ha,b{x) = q and hafi{x) = r. This is exaetly as in ([3]) and (jH), and by 
Lemmawe have a 1-1 eorrespondenee between pairs (a, b) E [p] x [p] and pairs (g, r) E [p]^. 
Sinee (a, b) is uniform in [p]^ it follows that (g, r) is uniform in [p]^, henee that the pair-wise event 
ha,b{x) = q and ha,b{x) = r happens with probability 1/p^. 

Exercise 3.7 Let m <p. For random (a, b) E [p]^, define the hash function ha^b '■ [p] ^ [fn] by 

ha,b{x) = {{ax + b) mod p) mod m. 

(a) Argue ha^ is strongly 2-universal. 

(b) In the universal multiply-mod-prime hashing from Section^ we insisted on a 0, but now 
we consider all a E [p]. Why this difference? 

3.3 Multiply-shift 

We now present a simple generalization from IIH of the universal multiply-shift seheme from See- 
tion [2] that yields strong universality. As a eonvenient notation, for any bit-string and integers 
j > i > 0, z[i,j) = z[i,j — 1] denotes the number represented by bits i,... ,j — 1 (bit 0 is the 
least signifieant bit, whieh eonfusingly, happens to be rightmost in the standard representation), so 

= [{z mod2^)/2*J. 

To hash [2"'] ^ [2^] we may piek any w > w i — 1. For any pair (a, b) E [w]^, we define 
ha,b : [2-] ^ [2^] by 

ha,b{x) = {ax-\-b)[w — i,w). (13) 

As for the universal multiply shift, we note that the seheme of (fTSI) is easy to implement with 
eonvenient parameter ehoiees, e.g., with w = QA,w = 32 and i = 20, we get the C-eode: 
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#include <stdint.h> 

// defines uint32/64_t as unsigned 32/64-bit integer. 
uint64_t hash(uint32_t x; uint32_t 1; uint64_t a; uint64_t b; ) { 

// hashes x strongly universally into 1 bits 
// using the random seeds a and b. 

return (a*x+b) >> (64-1);} 

We will prove that the seheme from (fT3l) is strongly universal. In the proof we will reason a lot 
about uniformly distributed variables, e.g., if X G [m] is uniformly distributed and /9 is a eonstant 
integer, then {X + (3) mod m is also uniformly distributed in [m]. More interestingly, we have 

Fact 3.2 Consider two positive integers a and m that are relatively prime, that is, a and m have 
no common prime factor. If X is uniform in [m], then [aX) mod m is also uniformly distributed 
in [m]. Important cases are (a) if a < m and m is prime, and (b) if a is odd and m is a power of 
two. 

Proof We want to show that for every y G [m] there is at most one x G [m] such that (ax) mod 
m = y, for then there must be exactly one x G [m] for each ye [m], and vice versa. Suppose we 
had distinct xi,X 2 G [m] such that {axi) mod m = y = ( 0 x 2 ) mod m. Then a{x 2 — xf) mod 
m = 0, so m is a divisor of a{x 2 — xf. By the fundamental theorem of arithmetic, every positive 
integer has a unique prime factorization, so all prime factors of m have to be factors of a{x 2 — Xi) 
in same or higher powers. Since m and a are relatively prime, no prime factor of m is factor of a, 
so the prime factors of m must all be factors of X 2 — Xi in same or higher powers. Therefore m 
must divide X 2 — Xi, contradicting the assumption xi ^ X 2 (mod m). Thus, as desired, for any 
ye [ra], there is at most one x G [m] such that {ax) mod m = y. m 


Theorem 3.3 When a,b e [2^ are uniform and independent, then the multiply-shift scheme from 
(O is strongly universal. 

Proof Consider any distinct keys x,y e [2"']. We want to show that ha,b{x) and ha^iv) are 
independent uniformly distributed variables in [2^]. 

Let s be the index of the least significant 1-bit in {y — x) and let z be the odd number such 
that {y — x) = z2^. Since z is odd and a is uniform in [2"'], by Fact 13.21 (bl. we have that az is 
uniform in [2^. Now a{y — x) = az2^ has all Os in bits 0, .., s — 1 and a uniform distribution on 
bits s,.., s + W — 1. The latter implies that a{y — x) [s,.., W — 1] is uniformly distributed in [2"'“^]. 

Consider now any fixed value of a. Since b is still uniform in [2^, we get that {ax + b) [0, w) 
is uniformly distributed, implying that {ax + b)[s,W) is uniformly distributed. This holds for any 
fixed value of a, so we conclude that {ax + 6)[s, wJ) and a{y — x)[s, wJ) are independent random 
variables, each uniformly distributed in [2*"“^]. 

Now, since a{y — x) [0, s) = 0, we get that 

{ay + b)[s, 00 ) = {{ax -f 6 ) + a{y — x))[s, cxo) = {{ax 4 - b)[s, 00 ) + {a{y — x))[s, 00). 
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The fact that a{y—x) [s, w) is uniformly distributed independently of (ax+b) [s, W) now implies that 
{ay + b)[s,w) is uniformly distributed independently of {ax+b)[s,w). However,uJ > w+i — 1 and 
s <w SOS < w—1 < w—i. Therefore (a:) = {ax+b)[w—i,W) and ha, b{y) = {ay+b)[w—i,w) 
are independent uniformly distributed variables in [2^]. ■ 

In order to reuse the above proof in more complicated settings, we crystallize a technical lemma 
from the last part: 

Lemma 3.4 Let w > w + i — f. Consider a random function g : U ^ [2*"] (in the proof of 
Theorem 13.31 we would have U = [2"'] and g{x) = axft^w)) with the property that there for any 
distinct x,y E U exists a positive s < w, determined by x and y (and not by g, e.g., in the proof of 
Theorem \3.3\ s was the least significant set bit in y — x), such that {g{y) — ^'(a;)) [0, s) = 0 while 
{g{y) — g{x))[s, w) is uniformly distributed in [2"'“^]. For b uniform in [2"^ and independent of g, 
define hg,b ■ U —)■ [2^] by 

hg,b{x) = {g{x) + b)[w- i,w). 

Then hg,b{x) is strongly universal. 


3.4 Vector multiply-shift 

Our strongly universal multiply shift scheme generalizes nicely to vector hashing. The goal is to 
get strongly universal hashing from [2"']'^ to 2^. With uJ>tu + £ — l,we pick independent uniform 
ao,..., ad_i, b E [2^ and define : [2"']^ [2^] by 


hao,...,aa-i,b{^Oi ■ ■ ■ 1 ^d—l) 



[w — i,w). 


(14) 


Theorem 3.5 The vector multiply-shift scheme from is strongly universal. 

Proof We will use Lemma 13.41 to prove that this scheme is strongly universal. We define g : 

[2“']'^ ^ [ 2 ^ by 

* * * : ^d—l^ I ^ ^ 

\ie[d] 

Consider two distinct keys x = {xq, ..., Xd-i) and y = (r/o, ■ ■ ■, Ud-i)- Let j be an index such that 
Xj f yj and such that the index s of the least significant set bit is as small as possible. Thus yj — Xj 
has 1 in bit s, and all i E [d] have {yj — xf) [0, s) = 0. As required by Lemma [T4l s is determined 
from the keys only, as required by Lemma [L4l Then 





'^ai{yi - Xi) [0,s) = 0 

\i&[d] J 
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regardless of Oq, ..., Od-i. Next we need to show that {g{y) — ^'(a;)) [s, w) is uniformly distributed 
in [2"'“'^]. The triek is to first fix all a*, i ^ j, arbitrarily, and then argue that {g{y) — g{x))[s, w) is 
uniform when Oj is uniform in [2®]. Let z be the odd number sueh that z2^ = yj — Xj. Also, let A 
be the eonstant defined by 

A 2 * = ^ aiigji-Xj). 


Now 

g{.y)-g{.x) = {ajz + /\)2\ 

With z odd and A a fixed eonstant, the uniform distribution on aj G [2"’] implies that {ajZ + 
A) mod 2"’ is uniform in [2^ but then {ajZ + A) mod 2"'“'^ = {g{y) — 5 '(x)) [s, w) is also uniform 
in [2“'“®]. Now Lemma 13.41 implies that the veetor multiply-shift seheme from (fT4l) is strongly 
universal. ■ 


Exercise 3.8 Corresponding to the universal hashing from Section^ suppose we tried with w = w 
and just used random odd oq, ..., Od-i G [2“’] and a random b G [2*"], and defined 


• • • , Xd-l) = I I X] ^ [w - j,w). 

\ / / 

Give an instance showing that this simplified vector hashing scheme is not remotely universal. 


Our veetor hashing ean also be used for universality, where it gives eollision probability 1/2^. As 
a small tuning, we eould skip adding b, but then we would only get the same 2/2^ bound as we had 
in SeetionO 


3.5 Pair-multiply-shift 

A eute triek from [[U allows us roughly double the speed of veetor hashing, the point being that 
multiplieation is by far the slowest operation involved. We will use exaetly the same parameters 
and seeds as for (fT4l) . However, assuming that the dimension d is even, we replaee (fT4l) by 

hao,...,aa-i,b{Xo,...,Xd-l) = ^ {a 2 i + X 2 i+l){a 2 i+l + X 2 i) +6 [w-i,w). (15) 

V ViG[d/2] / / 

This seheme handles pairs of eoordinates {2i,2i + 1) with a single multiplieation. Thus, with 
w = 64 and w = 32, we handle eaeh pair of 32-bit keys with a single 64-bit multiplieation. 

Exercise 3.9 (a bit more challenging) Prove that the scheme defined by f lTdl) is strongly universal. 
One option is to prove a tricky generalization of Lemma 13.41 where {g{y) — 5 '(x)) [0, s) may not be 
0 but can be any deterministic function ofx and y. With this generalization, you can make a proof 
similar to that for Theorem \3.5\ with the same definition ofj and s. 

Above we have assumed that d is even. In partieular this is a ease, if we want to hash an array of 64- 
bit integers, but east it as an array of 32-bit numbers. If d is odd, we ean use the pair-multiplieation 
for the first \ d/2\ pairs, and then just add OdXd to the sum. 
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4 String hashing 

4.1 Hashing vector prefixes 

Sometimes what we really want is to hash vectors of length up to D but perhaps smaller. As in the 
multiply-shift hashing schemes, we assume that each coordinate is from [2"']. The simple point is 
that we only want to spend time proportional to the actual length d < D. With w > w + i — 
pick independent uniform oq, ..., a^-i G For even d, we define h : Ueven ^<^[2’^]'^ ^ [2^] 
by 

hao,...,aDi^o,- ■ (®2i + a;2i+i)(a2i+i+2:2*) +ad \ [w-i,w). (16) 

V V* 6 [<i/ 2 ] / / 


Exercise 4.1 Prove that the above even prefix version of pair-multiply-shift is strongly universal. 
In the proof you may assume that the original pair-multiply-shift from f[731) is strongly universal, 
as you may have proved in Exercise [3.91 Thus we are considering two vectors x = (xq, ..., 3 :^- 1 ) 
and y = {yo, ■ ■ ■ ,yd'-i)- You should consider both the case d' = d and d' d. 

4.2 Hashing bounded length strings 

Suppose now that we want to hash strings of 8-bit characters, e.g., these could be the words in a 
book. Then the nil-character is not used in any of the strings. Suppose that we only want to handle 
strings up to some maximal length, say, 256. 

With the prefix-pair-multiply-shift scheme from (fT^ . we have a very fast way of hashing strings 
of d 64-bit integers, casting them as 2d 32-bit integers. A simple trick now is to allocate a single 
array x of 256/8 = 32 64-bit integers. When we want to hash a string s with c characters, we first 
set d = [c/8] (done fast by d= (c+7 ) >>3). Next we set Xd-i = 0, and finally we do a memory 
copy of s into x (using a statement like memcpy (x, s, c) ). Finally, we apply (fT^ to x. 

Note that we use the same variable array x every time we want to hash a string s. Let s* be the 
image of s created as a c* = [c/8] length prefix of x. 

Exercise 4.2 Prove that if s and t are two strings of length at most 256, neither containing the 
nil-character, then their images s* and t* are different. Conclude that we now have a strongly 
universal hash functions for such strings. 

Exercise 4.3 Implement the above hash function for strings. Use it in a chaining based hash table, 
and apply it to count the number of distinct words in a text (take any pdf-file and convert it to ascii, 
e.g., using pdf2txt). 

To get the random numbers defining your hash functions, you can go to random. org). 

One issue to consider when you implement a hash table is that you want the number m of 
entries in the hash array to be as big as a the number of elements (distinct words), which in our 
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case is not known in advance. Using a hash table of some start size m, you can maintain a count 
of the distinct words seen so far, and then double the size when the count reaches, say, m/2. 

Many ideas can be explored for optimization, e.g., if we are willing to accept a small false¬ 
positive probability, we can replace each word with a 32- or 64-bit hash value, saying that a word 
is new only if it has a new hash value. 

Experiments with some different texts: different languages, and different lengths. What happens 
with the vocabulary. 

The idea now is to check how much time is spent on the actual hashing, as compared with the 
real code that both does the hashing and follows the chains in the hash array. However, if we just 
compute the hash values, and don’t use them, then some optimizing compilers, will notice, and just 
do nothing. You should therefore add up all the hash values, and output the result, just to force the 
compiler to do the computation. 

4.3 Hashing variable length strings 

We now consider the hashing of a variable length strings xqXi ■ ■ ■ Xd where all characters belong 
to some domain [u]. 

We use a the method from Q, which first picks a prime p > u. The idea is view xq, ..., as 
coefficients of a degree d polynomial 


d 



over Zp. As seed for our hash function, we pick an argument a G [p], and compute the hash 
function 

haiy^Q • • ' Xd) (®)• 

Consider some other string y = y^yi ■ ■ - yd', d' < d. We claim that 


Pr [ha{xo ■■■Xd) = ha{yo ■ ■ ■ yd')] < d/p 


aG[p] 


The proof is very simple. By definition, the collision happens only if a is root in the polynomial 
-Pyov.yx ~ Pxo,...,xi- Since the strings are different, this polynomial is not the constant zero. More¬ 
over, its degree is at most d. Since the degree is at most d, the fundamental theorem of algebra 
tells us that it has at most d distinct roots, and the probability that a random a G [p] is among these 
roots is at most d/p. 

Now, for a fast implementation using Homer’s rule, it is better to reverse the order of the 
coefficients, and instead use the polynomial 


d 



mod p 


Then we compute Pxo,...,xS^) using the recurrence 


15 


• H^a = ^0 

• HI = + Xi) mod p 

• Pxo,...,Xd{^) — 


With this recurrence, we can easily update the hash value if new character Xd+i is added to the end 
of the string x^+i- It only takes an addition and a multiplication modulo p. For speed, we would 
let p be a Mersenne prime, e.g. 289 _ 

Recall from the discussed in Section [2.2.1l that the multiplication modulo a prime like 2®® — 1 
is a bit complicated to implement. 

The collision probability d/p may seem fairly large, but assume that we only want hash values 
in the range m < p/d, e.g, for m = 2^^ and p = 2®® — 1, this would allow for strings of length 
up to 2^^, which is big enough for most practical purposes. Then it suffices to compose the string 
hashing with a universal hash function from [p] to [m]. Composing with the previous multiply- 
mod-prime scheme, we end up using three seeds a,b,c E [p], and then compute the hash function 
as 


ha,b,c(,^0y ■ ■ ■ ) ^d) 



mod p 


mod m. 


Exercise 4.4 Assuming that strings have length at most p/m, argue that the collision probability 
is at most 2/m. 

Above we can let u be any value bounded by p. With p = 2®® — 1, we could use u = 2®"^ thus 
dividing the string into 64-bit characters. 

Exercise 4.5 Implement the above scheme and run it to get a 32-bit signature of a book. 

Major speed-up The above code is slow because of the multiplications modulo Mersenne primes, 
one for every 64 bits in the string. 

An idea for a major speed up is to divide you string into chunks Xq, ... ,Xj of 32 integers 
of 64 bits, the last chunk possibly being shorter. We want a single universal hash function r : 
Ud<32[2®"^]'^ —)■ [2®"^]. A good choice would be to use our strongly universal pair-multiply-shift 
scheme from (fT^ . It only outputs 32-bit numbers, but if we use two different such functions, we 
can concatenate their hash values in a single 64-bit number. 

Exercise 4.6 Prove that ifr has collision probability P, and if{XQ,... , Xj) (Fq, • • •, I/?')- 

Pr[(r(Xo), ..., r{Xf) = ..., r{Yp))] < P. 

The point above is that in the above is that r(Ao), ..., r{Xj) is 32 times shorter than Xq, ..., Xj. 
We can now apply our slow variable length hashing based on Mersenne primes to the reduced 
string r(Ao), ..., r(Xj). This only adds P to the overall collision probability. 

Exercise 4.7 Implement the above tuning. How much faster is your hashing now? 
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Splitting between short and long strings When implementing a generie string hashing eode, 
we do not know in advanee if it is going to be applied mostly to short or to long strings. Our 
seheme for bounded length strings from Seetion 14.21 is faster then the generie seheme presented 
above for variable length strings. In praetiee it is a good idea to implement both: have the seheme 
from Seetion |43I implemented for strings of length up to some d, e.g., d eould be 32 64-bit integers 
as in the above bloeks, and then only apply the generie seheme if the length is above d. 

Major open problem Can we get something simple and fast like multiply-shift to work direetly 
for strings, so that we do not need to eompute polynomials over prime fields? 
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